Whose questions you can answer and which questions you might be interested in?

Outline

The content of this kernel will cover two parts.

Part 1: Finding the users who always ask the similar questions with the specific user.
Part 2: Finding the users who always provide similar answers with the specific user.

Both parts will be finished with a two-step process: NLP and KNN model fitting. While the first part will be analyzed with the text of questions while the second part will use the text of answers to solve and analyze.



In [1]:

    
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
from wordcloud import WordCloud,STOPWORDS

Questions=pd.read_csv('./Questions.csv',encoding = 'iso-8859-1')
Answers=pd.read_csv('./Answers.csv',encoding = 'iso-8859-1')









    



C:\Users\Administrator\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)



In [2]:

    
User_id_inQ= Questions['OwnerUserId'].unique()
User_id_inA= Answers['OwnerUserId'].unique()



In [3]:

    
All_id=set(User_id_inQ).intersection(User_id_inA)



In [4]:

    
print('So we have '+str(len(All_id))+ \
      ' users that post both questions and answers on StackOverFlow')









    



So we have 11194 users that post both questions and answers on StackOverFlow



In [5]:

    
users=pd.DataFrame({'idUser':list(All_id)})



In [6]:

    
users.head()



In [7]:

    
users['Quantity']=users['idUser'].apply(lambda x: \
                    len(Questions[Questions['OwnerUserId']==x]['Body']) \
                    +len(Answers[Answers['OwnerUserId']==x]['Body']))



In [8]:

    
users.head()



In [9]:

    
users_final=users.sort(['Quantity'],ascending=0).reset_index(drop=True)
users_final.head()









    



C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':






    Out[9]:






  
    
      
      idUser
      Quantity
    
  
  
    
      0
      1855677.0
      4997
    
    
      1
      1270695.0
      2643
    
    
      2
      2372064.0
      2314
    
    
      3
      143305.0
      2245
    
    
      4
      1838509.0
      2201



In [10]:

    
users_final=users_final.iloc[0:10000,]
users_final.shape









    Out[10]:





(10000, 2)



In [11]:

    
All_id=list(users_final['idUser'])

Firstly, create a function that can clean the body of questions and answers. Only the main body of questions will be used.



In [12]:

    
# remove the code part from questions
body = Questions['Body'].str.replace(r'<code>[^<]+</code>',' ')
# build up the question part from questions
Questions['QuestionBody'] = body.str.replace(r"<[^>]+>|\n|\r", " ")



In [13]:

    
Questions.head()









    Out[13]:






  
    
      
      Id
      OwnerUserId
      CreationDate
      Score
      Title
      Body
      QuestionBody
    
  
  
    
      0
      77434
      14008.0
      2008-09-16T21:40:29Z
      134
      How to access the last value in a vector?
      <p>Suppose I have a vector that is nested in a...
      Suppose I have a vector that is nested in a d...
    
    
      1
      79709
      NaN
      2008-09-17T03:39:16Z
      1
      Worse sin: side effects or passing massive obj...
      <p>I have a function inside a loop inside a fu...
      I have a function inside a loop inside a func...
    
    
      2
      95007
      15842.0
      2008-09-18T17:59:19Z
      48
      Explain the quantile() function in R
      <p>I've been mystified by the R quantile funct...
      I've been mystified by the R quantile functio...
    
    
      3
      103312
      NaN
      2008-09-19T16:09:26Z
      4
      How to test for the EOF flag in R?
      <p>How can I test for the <code>EOF</code> fla...
      How can I test for the   flag in R?     For e...
    
    
      4
      255697
      1941213.0
      2008-11-01T15:48:30Z
      3
      Is there an R package for learning a Dirichlet...
      <p>I'm looking for a an <code>R</code> package...
      I'm looking for a an   package which can be u...



In [15]:

    
# remove the code part from questions
body = Answers['Body'].str.replace(r'<code>[^<]+</code>',' ')
# build up the question part from questions
Answers['QuestionBody'] = body.str.replace(r"<[^>]+>|\n|\r", " ")



In [16]:

    
Answers.head()









    Out[16]:






  
    
      
      Id
      OwnerUserId
      CreationDate
      ParentId
      Score
      IsAcceptedAnswer
      Body
      QuestionBody
    
  
  
    
      0
      79741
      3259.0
      2008-09-17T03:43:22Z
      79709
      -1
      False
      <p>It's tough to say definitively without know...
      It's tough to say definitively without knowin...
    
    
      1
      79768
      6043.0
      2008-09-17T03:48:29Z
      79709
      5
      False
      <p>use variables in the outer function instead...
      use variables in the outer function instead o...
    
    
      2
      79779
      8002.0
      2008-09-17T03:49:36Z
      79709
      0
      False
      <p>Third approach: inner function returns a re...
      Third approach: inner function returns a refe...
    
    
      3
      79788
      NaN
      2008-09-17T03:51:30Z
      79709
      3
      False
      <p>It's not going to make much difference to m...
      It's not going to make much difference to mem...
    
    
      4
      79827
      14257.0
      2008-09-17T03:58:26Z
      79709
      1
      False
      <p>I'm not sure I understand the question, but...
      I'm not sure I understand the question, but I...



In [17]:

    
Q_data=Questions[['OwnerUserId','QuestionBody']]
A_data=Answers[['OwnerUserId','QuestionBody']]
Question=Q_data[Q_data['OwnerUserId'].isin(All_id)]
Answer=A_data[A_data['OwnerUserId'].isin(All_id)]



In [25]:

    
Question.head()









    Out[25]:






  
    
      
      OwnerUserId
      QuestionBody
    
  
  
    
      2
      15842.0
      I've been mystified by the R quantile functio...
    
    
      6
      37751.0
      I know that R works most efficiently with vec...
    
    
      7
      37751.0
      So earlier I answered my own question on thin...
    
    
      9
      12677.0
      I have imported a time series with dates of t...
    
    
      10
      277.0
      I have a CSV of file of data that I can load ...



In [26]:

    
Answer.head()









    Out[26]:






  
    
      
      OwnerUserId
      QuestionBody
    
  
  
    
      6
      15842.0
      If you're looking for something as nice as Py...
    
    
      7
      1428.0
      I use the   function:         The nice thing ...
    
    
      11
      23813.0
      Combining lindelof's and Gregg Lind's ideas: ...
    
    
      14
      37751.0
      Linprog, mentioned by Galwegian, focuses on l...
    
    
      15
      37751.0
      Clearly I should have worked on this for anot...



In [29]:

    
Answer.shape









    Out[29]:





(151183, 2)



In [32]:

    
Answer['QuestionBody'][6]









    Out[32]:





u" If you're looking for something as nice as Python's x[-1] notation, I think you're out of luck.  The standard idiom is         but it's easy enough to write a function to do this:         This missing feature in R annoys me too!  "



In [35]:

    
type(Question.QuestionBody)









    Out[35]:





pandas.core.series.Series



In [34]:

    
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
Q_features=tfidf.fit_transform(Question.QuestionBody)
A_features=tfidf.fit_transform(Answer.QuestionBody)



In [36]:

    
type(Q_features)









    Out[36]:





scipy.sparse.csr.csr_matrix



In [37]:

    
Q_features









    Out[37]:





<74619x79620 sparse matrix of type '<type 'numpy.float64'>'
	with 4455120 stored elements in Compressed Sparse Row format>



In [38]:

    
A_features









    Out[38]:





<151183x61379 sparse matrix of type '<type 'numpy.float64'>'
	with 5247469 stored elements in Compressed Sparse Row format>

	idUser
0	1900545.0
1	1867780.0
2	4259841.0
3	950280.0
4	5313987.0

	idUser	Quantity
0	1900545.0	15
1	1867780.0	3
2	4259841.0	2
3	950280.0	4
4	5313987.0	2

	idUser	Quantity
0	1855677.0	4997
1	1270695.0	2643
2	2372064.0	2314
3	143305.0	2245
4	1838509.0	2201

	Id	OwnerUserId	CreationDate	Score	Title	Body	QuestionBody
0	77434	14008.0	2008-09-16T21:40:29Z	134	How to access the last value in a vector?	<p>Suppose I have a vector that is nested in a...	Suppose I have a vector that is nested in a d...
1	79709	NaN	2008-09-17T03:39:16Z	1	Worse sin: side effects or passing massive obj...	<p>I have a function inside a loop inside a fu...	I have a function inside a loop inside a func...
2	95007	15842.0	2008-09-18T17:59:19Z	48	Explain the quantile() function in R	<p>I've been mystified by the R quantile funct...	I've been mystified by the R quantile functio...
3	103312	NaN	2008-09-19T16:09:26Z	4	How to test for the EOF flag in R?	<p>How can I test for the <code>EOF</code> fla...	How can I test for the flag in R? For e...
4	255697	1941213.0	2008-11-01T15:48:30Z	3	Is there an R package for learning a Dirichlet...	<p>I'm looking for a an <code>R</code> package...	I'm looking for a an package which can be u...

	Id	OwnerUserId	CreationDate	ParentId	Score	IsAcceptedAnswer	Body	QuestionBody
0	79741	3259.0	2008-09-17T03:43:22Z	79709	-1	False	<p>It's tough to say definitively without know...	It's tough to say definitively without knowin...
1	79768	6043.0	2008-09-17T03:48:29Z	79709	5	False	<p>use variables in the outer function instead...	use variables in the outer function instead o...
2	79779	8002.0	2008-09-17T03:49:36Z	79709	0	False	<p>Third approach: inner function returns a re...	Third approach: inner function returns a refe...
3	79788	NaN	2008-09-17T03:51:30Z	79709	3	False	<p>It's not going to make much difference to m...	It's not going to make much difference to mem...
4	79827	14257.0	2008-09-17T03:58:26Z	79709	1	False	<p>I'm not sure I understand the question, but...	I'm not sure I understand the question, but I...

	OwnerUserId	QuestionBody
2	15842.0	I've been mystified by the R quantile functio...
6	37751.0	I know that R works most efficiently with vec...
7	37751.0	So earlier I answered my own question on thin...
9	12677.0	I have imported a time series with dates of t...
10	277.0	I have a CSV of file of data that I can load ...

	OwnerUserId	QuestionBody
6	15842.0	If you're looking for something as nice as Py...
7	1428.0	I use the function: The nice thing ...
11	23813.0	Combining lindelof's and Gregg Lind's ideas: ...
14	37751.0	Linprog, mentioned by Galwegian, focuses on l...
15	37751.0	Clearly I should have worked on this for anot...